Skip to content

Add tdbg schedule audit command#10376

Open
fretz12 wants to merge 3 commits into
temporalio:mainfrom
fretz12:fredtzeng/tdbg-schedule-audit
Open

Add tdbg schedule audit command#10376
fretz12 wants to merge 3 commits into
temporalio:mainfrom
fretz12:fredtzeng/tdbg-schedule-audit

Conversation

@fretz12
Copy link
Copy Markdown
Contributor

@fretz12 fretz12 commented May 24, 2026

What changed?

Adds a new tdbg schedule audit subcommand under tdbg schedule. The command lists every schedule in the target namespace(s), computes the nominal fire times each spec should have produced inside a user-specified window, queries visibility for the workflows that actually ran, and emits a per-schedule classification of each expected fire:

  • real_miss — expected fire with no matching workflow and nothing else from the schedule running to justify a skip
  • skip_overlap — fire correctly skipped because a prior workflow was still running
  • inconclusive_schedule_changed — schedule spec was modified during the audit window; historical spec can't be recovered
  • unsupported_policy — schedule uses a policy this audit doesn't fully model (BUFFER_ALL, ALLOW_ALL, CANCEL_OTHER, TERMINATE_OTHER, KeepOriginalWorkflowId); surfaced rather than miscounted

Output is either a CSV bundle (summary.csv + per-namespace files) when --output-dir is set, or a single flat CSV stream to stdout when it isn't.

Why?

We've had several incidents where schedules silently missed fires, and answering "did the scheduler fire when it should have, during this window?" required ad-hoc visibility queries per schedule. This tool makes that question answerable in a single command across all schedules in many namespaces in parallel, and produces a machine-readable artifact that can be diffed across days to spot regressions.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

@fretz12 fretz12 marked this pull request as ready for review May 26, 2026 14:35
@fretz12 fretz12 requested a review from a team as a code owner May 26, 2026 14:35
Comment thread tools/tdbg/schedule_audit.go Outdated
// - UnsupportedReason != "" -> reclassify real_miss to unsupported_policy and stamp the reason. These are
// corner-case policy/state configs the algorithm does not model correctly today, so we move the count out of
// the trusted real_miss bucket and surface the row for manual review. Reasons currently detected:
// - keep_original_workflow_id -- all fires share one WorkflowID, collapsing the chain-by-WorkflowID model.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/temporalio/api/blob/0a2f0c3aff1f58e9cec5877d2896bdff5985431d/temporal/api/schedule/v1/message.proto#L213

this is in the API but not currently supported. It won't be for quite some time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, changed comment

@chaptersix
Copy link
Copy Markdown
Contributor

I think input and output should be json/josnl. we have jq available in our environments to process input and output.

NamespaceConcurrency int
}

func parseAuditInputs(c *cli.Context) (*auditInputs, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we support unix pipes instead of specifying an input file? the piped input could contain, namespace and schedule id (optional).

that way someone can stream the input from another process without writing a file and they ca cat a file into stdin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that was on my todo list.

FILE FORMAT (for --file / stdin)
  One audit target per line as 'namespace[,schedule_id]'. Examples:
    checker-ses-northwest-prod.90d0d
    datastore-northwest-prod.90d0d,DeleteExpiredSecretsScheduledWorkflow--one
    synthetics-northwest-prod.90d0d,my-schedule
  Lines starting with '#' and blank lines are ignored. Schedule IDs must not contain commas.

EXAMPLES
  Single namespace, 1-day window, write CSV bundle:
    tdbg schedule audit --namespace my-ns --start 2026-05-19T00:00:00Z --end 2026-05-20T00:00:00Z --output-dir ./audit-out

  Many targets from a file:
    tdbg schedule audit -f ./targets.csv --start 2026-05-01T19:30:00Z --end 2026-05-02T10:00:00Z \
      --output-dir ./audit-out

  Pipe targets from stdin (cat, psql, awk, etc.):
    cat ./targets.csv | tdbg schedule audit -f - --start 2026-05-01T00:00:00Z --end 2026-05-02T00:00:00Z

Comment thread tools/tdbg/schedule_audit.go Outdated
}
}

// expectedFireTimes returns the nominal (pre-jitter) fire times the spec would produce in (start, end]. Uses the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post jitter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually pre-jitter is intentional...workflows started by the scheduler carry TemporalScheduledStartTime set to the nominal (pre-jitter) time, so the audit needs nominal to match the fired vs expected. I found this to be most reliable. I actually removed the jitterseed as it's not relaly needed and in mislead. Added comment as well.

Comment thread tools/tdbg/schedule_audit.go Outdated

// maxAuditWindow caps how wide a single audit window can be. Catches typos (e.g. wrong month in --end) and discourages
// expensive multi-day runs that should be chunked into separate invocations.
const maxAuditWindow = 7 * 24 * time.Hour
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be overridable. some ns may have schedules that run once a year.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea good point. Turned it into a flag, and added warning if exceeds 7days

@fretz12 fretz12 requested a review from chaptersix May 29, 2026 23:21
fretz12 added 3 commits June 1, 2026 11:38
New `tdbg schedule audit` command that detects missed schedule fires by comparing expected fires from each schedule's spec against actual workflow executions in visibility. Reports per-schedule classification (real_miss / skip_overlap / inconclusive_schedule_changed / unsupported_policy) as CSV, either bundled per-namespace to a directory or streamed flat to stdout.
@fretz12 fretz12 force-pushed the fredtzeng/tdbg-schedule-audit branch from d13ec41 to a0f8f3f Compare June 1, 2026 18:38
Comment on lines +163 to +166
if d := in.WindowEnd.Sub(in.WindowStart); d > defaultMaxAuditWindow {
_, _ = fmt.Fprintf(os.Stderr, "warning: window is %s (longer than default cap %s); proportionally slower and more memory-intensive\n",
d.Round(time.Hour), defaultMaxAuditWindow)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a bit too verbose for a debug tool.

Comment on lines +171 to +174
// defaultMaxAuditWindow caps how wide a single audit window can be by default. Catches typos (e.g. wrong month in
// --end) and discourages expensive multi-day runs that should be chunked into separate invocations. Operators can
// raise the cap via --max-window for schedules that fire less frequently (quarterly, yearly).
const defaultMaxAuditWindow = 7 * 24 * time.Hour
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit redundant to have both a --max-window and a --start-window option, if one can't work without also setting the other; how about droping the max, and setting a default start/end time that's [now-7d, now]?

Comment on lines +412 to +413
// we'd miss those workflows and falsely flag them as real_miss. 24h is generous for the patterns we've observed
// in practice; may need to raise if we encounter BUFFER chains that routinely run longer than a day.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't realized we had data on this already (how long buffer chains could be) - where have we observed this from?

Comment on lines +1005 to +1008
if policies.GetKeepOriginalWorkflowId() {
// All fires share one WorkflowID, breaking our chain-by-WorkflowID model.
reasons = append(reasons, "keep_original_workflow_id")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, this setting has no effect in practice - known issue. We've considered introducing support for it, but you can leave this out for now (particularly since if a customer did have it set, it wouldn't be doing anything for them).

Comment on lines +1010 to +1012
case enumspb.SCHEDULE_OVERLAP_POLICY_BUFFER_ALL:
// Fires can be buffered for arbitrary durations; our delayedFireBuffer of 24h may miss the workflow.
reasons = append(reasons, "overlap_buffer_all")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I don't think this is much riskier in practice than hedging the 24h against BUFFER_ONE, I'd support it.

Comment on lines +1022 to +1023
// real_miss as skip_overlap when a long-running prior fire is active
// at the expected time. Surface for inspection; we don't reclassify
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I quite follow; for this policy, if we're querying for started workflows based on the StartedByScheduleID SA, I'd expect us to have fired actions regardless of any existing running state. Since the WID is always unique per fire time (since GetKeepOriginalWorkflowId doesn't actually do anything), you can reasonably expect every time in a schedule's allow_all spec to match a fired WF.

Comment on lines +896 to +899
var jitterSecs int64
if j := info.GetSpec().GetJitter(); j != nil {
jitterSecs = int64(j.AsDuration().Seconds())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think we can drop this since we explicitly ignore jitter during processing

Comment on lines +932 to +935
var jitterSecs int64
if j := spec.GetJitter(); j != nil {
jitterSecs = int64(j.AsDuration().Seconds())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, drop filling in jitter imo

var rpcErr error
resp, rpcErr = l.client.ListSchedules(ctx, &workflowservice.ListSchedulesRequest{
Namespace: namespace,
MaximumPageSize: 1000,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use visibilityPageSize?


for ns, rs := range byNS {
// Sanitize "/" -> "_"
safeName := strings.NewReplacer("/", "_", string(filepath.Separator), "_").Replace(ns)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also replace .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants